Back

The Annals of Applied Statistics

Institute of Mathematical Statistics

Preprints posted in the last 90 days, ranked by how well they match The Annals of Applied Statistics's content profile, based on 15 papers previously published here. The average preprint has a 0.00% match score for this journal, so anything above that is already an above-average fit.

1
Generative AI-assisted Bayesian-frequentist Hybrid Inference in Single-cell RNA Sequencing Analysis for Genes Associated with Alzheimer's Disease

Han, G.; Yuan, A.; Oware, K. D.; Wright, F.; Carroll, R. J.; Smith, M.; Ory, M. G.; Yan, D.; Wang, W.; Sun, Z.; Dai, Q.; Allen, C.; Dang, A.; Liu, Y.

2026-04-20 geriatric medicine 10.64898/2026.04.17.26351142 medRxiv
Top 0.1%
6.9%
Show abstract

Alzheimers disease genomics and other high-dimensional omics studies demand powerful statistical methods, yet Bayesian inference remains underutilized despite its advantages in small-sample settings, owing to the prohibitive cost of eliciting reliable priors across thousands or millions of parameters. We propose an AI-assisted Bayesian-frequentist hybrid inference framework that couples large language model based prior elicitation with the hybrid inference theory of Yuan (2009). ChatGPT-4o is queried via a standardized prompt to assess the strength of evidence linking each gene to a disease of interest, and the response is mapped to an informative normal prior via a standardized effect-size calibration. Parameters for covariates of secondary interest are treated as frequentist parameters, preserving efficiency and avoiding sensitivity to mis-specified priors. We derive closed-form hybrid estimators under uniform and conjugate normal priors in linear models, establish their asymptotic equivalence to the frequentist and full Bayes estimators, and show in simulations that hybrid inference using unconditional variance estimation leads to high statistical power while accurately controlling the Type I error rate. Applied to single-cell RNA sequencing data from the ROSMAP cohort for Alzheimers disease as an example, the framework identifies biologically coherent pathways (such as gamma-secretase pathways) previously undetected. The proposed framework offers a principled and computationally scalable approach to genome-wide Bayesian analysis, with potential for broad application across omics platforms and disease settings.

2
The Rayleigh Quotient and Contrastive Principal Component Analysis II

Jackson, K. C.; Carilli, M. T.; Pachter, L.

2026-04-10 bioinformatics 10.64898/2026.04.08.717236 medRxiv
Top 0.1%
3.7%
Show abstract

Contrastive principal component analysis (PCA) methods are effective approaches to dimensionality reduction where variance of a target dataset is maximized while variance of a background dataset is minimized. We previously described how contrastive PCA problems can be written as solutions to generalized eigenvalue problems that maximize particular instantiations of the Rayleigh quotient. Here, we discuss two extensions of contrastive PCA: we use kernel weighting from spatial PCA (k-{rho}PCA) to contrast spatial and non-spatial axes of variation, and separately solve the Rayleigh quotient in the space of basis function coefficients (f-{rho}PCA) to find modes of variation in functional data. Together, these extensions expand the scope of contrastive PCA while unifying disparate fields of spatial and functional methods within a single conceptual and mathematical framework. We showcase the utility of these extensions with several examples drawn from genomics, analyzing gene expression in cancer and immune response to vaccination.

3
Mediation analysis in longitudinal data: an unbiased estimator for cumulative indirect effect

Li, Y.; Cabral, H.; Tripodis, Y.; Ma, J.; Levy, D.; Joehanes, R.; Liu, C.; Lee, J.

2026-04-20 epidemiology 10.64898/2026.04.18.26351189 medRxiv
Top 0.1%
1.8%
Show abstract

Mediation analysis quantifies how an exposure affects an outcome through an intermediate variable. We extend mediation analysis to capture the cumulative effects of longitudinal predictors on longitudinal outcomes. Our proposed model examines how mediators transmit the effects of the current and previous exposure on the current outcome. We construct a least-squared estimator for cumulative indirect effect (CIE) and used three approaches (exact form, delta method, and bootstrap procedure) to estimate its standard error (SE). The estimator of CIE is unbiased with no unmeasured confounding and independent model errors between mediator model and outcome model at all time points, as shown in statistical inference and in simulations. While three SE estimates are numerically similar, bootstrap procedure is recommended due to its simplicity in implementation. We apply this method to Framingham Heart Study offspring cohort to assess if DNA methylation mediates the association of alcohol consumption with systolic blood pressure over two time points. We identify two CpGs (cg05130679 and cg05465916) as mediators and construct a composite DNA methylation score from 11 CpGs, which mediates for 39% of the cumulative effect. In conclusion, we propose an unbiased estimator for CIE. Future studies will investigate the missingness in mediators and outcomes.

4
Unlocking Multi-Sample Differential Expression for Spatial Transcriptomics Data with TESSERA

Constantine, F.; Laszik, Z.; Dudoit, S.; Purdom, E.

2026-04-30 bioinformatics 10.64898/2026.04.27.720955 medRxiv
Top 0.1%
1.2%
Show abstract

Spatial transcriptomics allows the unprecedented examination of gene expression levels at the resolution of spatially-situated single cells in a high-throughput manner. As the technology is adopted more broadly, studies frequently collect data from multiple tissue samples, which leads to unique challenges that traditional spatial statistical methods are not equipped to handle. In particular, factors that differ across samples, such as different coordinate systems, different numbers and types of cells, different underlying tissue architectures, among others, preclude the application of traditional methods to our new setting. In this work, we propose a novel method, TESSERA, based on a spatial generalized linear model, for analyzing multi-sample spatial transcriptomics count data. Importantly, we provide a mathematical and computational framework for efficient and scalable model fitting and statistical inference to accompany the specification of our model. Our method for fitting the model enables the estimation of a common set of fixed effects across samples. This allows us to address a variety of differential expression questions, such as identification of which genes are differentially expressed between conditions (e.g., diseases, treatments), while accounting for spatial correlation between cells within a sample. We benchmark our proposed method on simulated data and apply it to a spatial transcriptomics dataset of human kidney samples. We find that our method provides a hitherto nonexistent extension to the multi-sample setting while remaining competitive with or outperforming existing algorithms in the single-sample setting.

5
The ATLAS Penalty: Auxiliary-Transformed Location-Aware Smoothing with Applications to Spatial Transcriptomics

Tang, Q.; Chi, E. C.; Wang, W.

2026-05-20 bioinformatics 10.64898/2026.05.18.725545 medRxiv
Top 0.1%
0.9%
Show abstract

We address the problem of fitting a collection of location-specific models under a spatial smoothness assumption. Existing approaches penalize roughness in the model parameters directly, an assumption that breaks down when smoothness is a function of parameters and auxiliary covariates rather than the parameters themselves. Our framework, the Auxiliary-Transformed Location-Aware Smoothing (ATLAS) penalty, generalizes spatial smoothness by penalizing roughness in transformations of model parameters using auxiliary information. As a concrete case study, we develop a spatially smooth deconvolution model for spatial transcriptomics that estimates tumor mixing coefficients from thousands of spots distributed on a single tissue slide. To handle the computational challenges posed by the nonlinear likelihood, nonsmooth nonconvex penalty, and spatially coupled estimation, we propose an alternating direction method of multipliers (ADMM) algorithm. Through simulation studies, we demonstrate that our framework provides substantially better spatial domain detection than approaches that smooth model parameters directly, with particularly strong gains when auxiliary covariates carry calibrated spatial structure.

6
Granger Sensori-Behavioral Taxonomy of Neuronal Ensemble Activity from Two-Photon Calcium Imaging Data

Khosravi, S.; Francis, N. A.; Kanold, P. O.; Babadi, B.

2026-05-15 neuroscience 10.64898/2026.05.12.724603 medRxiv
Top 0.1%
0.9%
Show abstract

Understanding how neuronal populations interact to encode and transform sensory information is a fundamental challenge in computational neuroscience. Most existing studies, however, study neural encoding, behavioral readout, and functional connectivity as disjoint problems. Two-photon calcium imaging enables simultaneous recording of large neuronal ensembles in vivo, driven by diverse stimuli and eliciting distinct behaviors. However, extracting directional functional connectivity metrics as well as encoding and readout properties of neurons from such data remains difficult due to indirect and noisy observations of spiking activity, slow temporal dynamics, and the latent interplay between external stimuli and endogenous neural processes. Here, we introduce a unified conceptual and operational modeling and inference framework for directly extracting functional Granger causal (GC) effects between neurons, from external stimuli to neurons, and from neurons to behavior, from two-photon imaging data, in the sense of Granger. Inspired by the intersection information framework, we also identify neurons that encode features of sensory stimuli that inform behavioral readout. The resulting GC networks together with the taxonomy of functional sensori-behavioral relevance, which we call G-taxonomy, provides a powerful statistical analysis framework, enabled by the integration of several techniques including state-space modeling and inference, variational inference, and point processes. We applied the proposed framework to simulated and experimentally-recorded two-photon imaging from the mouse auditory cortex (A1) during both passive listening and active tone discrimination. Our simulation studies reveal significant improvement of our proposed methodology over existing techniques. Analysis of experimental data from the mouse A1 identifies distinct groups of cells with diverse sensori-behavioral relevance, as well as changes in functional connectivity associated with correct vs. incorrect behavior. In summary, this work provides a principled and data-driven methodology for uncovering directional interactions among the neurons, sensory stimuli, and behavior, all within the same statistical framework, offering new insights into how distributed cortical populations transform sensory inputs into behaviorally relevant representations. Author SummaryThe brain processes sensory inputs through the coordinated activity of large networks of neurons and produces readouts that elicit behavior. Understanding how information flows and is processed through these networks is a central goal of neuroscience. In this study, we present a new computational framework that identifies directional interactions among neurons in an ensemble as well as from sensory stimuli to neurons and from neurons to behavior. Utilizing the Granger formalism to identify directional effects, as opposed to common correlational measures, our framework extracts said effects directly from two-photon calcium imaging data. We tested our proposed method on both simulated data and recordings from the auditory cortex of mice during passive listening and active tone discrimination tasks. Our method revealed diverse groups of neurons in the auditory cortex with distinct functional roles and relevance to sensori-behavioral integration. Our framework provides a new way to study the flow of information in the brain and can be broadly applied to uncover neural computations across sensory and cognitive systems.

7
Spurious correlation inflates performance in single-cell perturbation prediction

Nicol, P. B.; Shivakumar, S.; Irizarry, R.

2026-05-12 bioinformatics 10.64898/2026.05.07.723486 medRxiv
Top 0.1%
0.8%
Show abstract

The increasing number of computational methods designed to predict the effects of genetic perturbations on cellular gene expression profiles has led to a need for rigorous evaluation metrics. Recent benchmarking studies rely on correlation or cosine similarity of differential expression relative to a shared population of control cells. We show that these metrics are systematically inflated by statistical bias induced by reusing the same control population to define both quantities being compared. As a result, even non-informative methods can appear to perform well, particularly in datasets with limited numbers of control cells. Reanalysis of published datasets using a simple control-splitting procedure that removes this bias leads to a substantial reduction in performance previously attributed to biological signal.

8
Synthetic Data Generation and Nonparametric Techniques for Assessing Multivariate Similarity to Address Small-Sample Size Challenges

Heine, J.; Fowler, E.; Eschrich, S. A.; Schell, M.

2026-05-07 bioinformatics 10.64898/2026.05.04.722226 medRxiv
Top 0.1%
0.7%
Show abstract

Data modeling in biomedical research often operates in the small-sample regime, where the number of observations is small relative to the data dimensionality; the detrimental effects of limited sample sizes are well documented in cancer studies. Synthetic data offers a potential solution to data shortfalls provided that the data generated is an adequate facsimile of the underlying distribution; the adequacy of such synthetic data remains an open-ended problem. In this work, we evaluate a synthetic generator proposed previously. The generator applies a series of transformations to the observed data to accommodate the small-sample size resulting in an uncoupled representation, where uncorrelated marginal distributions are modeled with optimized univariate kernel density estimation. In this report, (1) we develop a nonparametric method for assessing multivariate similarity based on the Cramer-Wold theorem and random projection testing, (2) investigate when the absence of bivariate correlation approximates independence in a non-normal setting, and (3) evaluate artifacts induced by data compression. The presentation is primarily methodological; low-dimensional data were used so each stage of the generation process could be analyzed explicitly. A formal testing framework was developed by comparing random projection level outcomes with a two-sample test, modeling these outcomes as Bernoulli trials, aggregating replicate outcomes within each projection direction, and pooling outcomes across many directions, yielding a scalable standardized normal test-statistic. The key innovation was decoupling the two-sample test significance level from that governing finalized normal inference. We showed the same projection framework also evaluates the full multivariate covariance structure. The generator produced high-fidelity multivariate synthetic data when the bivariate correlation approximates independence in the non-normal setting; in highly compressed data, residual modes were best modeled as normally distributed regardless of their intrinsic distributional form. Ongoing work includes applying these methods to higher-dimensional, diverse data.

9
A Zero-Inflated Hierarchical Generalized Transformation Model to Address Non-Normality in Spatially-Informed Cell-Type Deconvolution

Melton, H. J.; Bradley, J. R.; Wu, C.

2026-03-06 genomics 10.1101/2024.06.24.600480 medRxiv
Top 0.1%
0.7%
Show abstract

Oral squamous cell carcinomas (OSCC), the predominant head and neck cancer, pose significant challenges due to late-stage diagnoses and low five-year survival rates. Spatial transcriptomics offers a promising avenue to decipher the genetic intricacies of OSCC tumor microenvironments. In spatial transcriptomics, Cell-type deconvolution is a crucial inferential goal; however, current methods fail to consider the high zero-inflation present in OSCC data. To address this, we develop a novel zero-inflated version of the hierarchical generalized transformation model (ZI-HGT) and apply it to the Conditional AutoRegressive Deconvolution (CARD) for cell-type deconvolution. The ZI-HGT serves as an auxiliary Bayesian technique for CARD, reconciling the highly zero-inflated OSCC spatial transcriptomics data with CARDs normality assumption. The combined ZI-HGT + CARD framework achieves enhanced cell-type deconvolution accuracy and quantifies uncertainty in the estimated cell-type proportions. We demonstrate the superior performance through simulations and analysis of the OSCC data. Furthermore, our approach enables the determination of the locations of the diverse fibroblast population in the tumor microenvironment, critical for understanding tumor growth and immunosuppression in OSCC.

10
Identifying Inheritance Patterns of Allelic Imbalance, using Integrative Modeling and Bayesian Inference

Hoyt, S. H.; Reddy, T. E.; Gordan, R.; Allen, A. S.; Majoros, W. H.

2026-03-31 bioinformatics 10.64898/2026.03.28.714974 medRxiv
Top 0.1%
0.7%
Show abstract

Interpreting the effects of novel mutations on phenotypic traits remains challenging, particularly for cis-regulatory variants. For rare variants, individuals typically possess at most one affected copy of the causal allele, leading to allelic imbalance, and thus the ability to infer inheritance of allelic imbalance can inform genetic studies of phenotypic traits. While many methods for detection of allele-specific expression (ASE) exist, they largely focus on ASE in one individual. We show that performing joint inference across multiple individuals in a trio allows for simultaneously improving estimates of ASE and identifying its likely mode of inheritance. Our Bayesian approach has the benefit of being able to (1) aggregate information across individuals so as to improve statistical power, (2) estimate uncertainty in estimates, and (3) rank modes of inheritance by posterior probability. We demonstrate that this model is also applicable to other forms of imbalance such as allele-specific chromatin accessibility. Applying the model to ATAC-seq and RNA-seq from several trios, we uncover examples in which ASE can be linked to imbalance in chromatin state of cis-regulatory elements and to potential causal variants. As the cost of sequencing continues to decrease, we expect that powerful methodologies such as the one presented here will promote more routine collection of samples from related individuals and improve our understanding of genetic effects on gene regulation and their contribution to phenotypic traits.

11
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 0.1%
0.7%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

12
Omitted familial extrinsic risk inflates inferred intrinsic lifespan heritability

Kornilov, S. A.

2026-04-06 genetics 10.64898/2026.04.02.716222 medRxiv
Top 0.1%
0.7%
Show abstract

Shenhar et al. (2026) report 50% "intrinsic" lifespan heritability after calibrating a one-component correlated-frailty survival model to Scandinavian twin lifespans. Their framework is mathematically coherent, but the intrinsic component is not identified if heritable, mortality-relevant extrinsic susceptibility is omitted at calibration. We show that one-component calibration absorbs omitted familial extrinsic structure into the intrinsic frailty scale parameter{sigma}{theta} , and that this variance absorption is visible through separate diagnostics (1) Variance absorption. Under misspecification,{sigma}{theta} is inflated by +22.1% (95% CI: 21.5-22.7%), corresponding to +49% inflation in [Formula]. Falconer h2 is downstream of calibration and inherits a +9.2 pp bias (95% CI: 8.7-9.7). The{sigma}{theta} inflation is model-general: +22% (GM), +18% (MGG), +14% (SR); any dependence summary that is strictly increasing in{sigma}{theta} inherits this inflation, so Falconer h2 is one affected downstream quantity among many (Corollary B3). (2) Structural fingerprint. In the joint twin survival surface S(t1, t2), misspecification produces systematic dependence errors (ISE 48x that of the recovery model). Conditional twin dependence is inflated at all ages, peaking at age 80 ({Delta}r = 0.048). (3) Specificity. The bias requires an omitted component that is both heritable and mortality-relevant. Three negative controls, a boundary check ({rho} = 0), and a two-component recovery refit ({sigma}{theta} restored to within -3.2%) establish specificity. ACE decomposition yields C {approx} 0 throughout: the omitted extrinsic component loads onto A (because it is shared 1.0/0.5 in MZ/DZ), so switching summary statistics does not restore identification. (4) Sensitivity and falsifiability. Over an empirically anchored regime ({sigma}{gamma} [isin] [0.30, 0.65],{rho} [isin] [0.20, 0.50]), Falconer bias ranges from +2.8 to +18.9 pp (mean 9 pp). If{rho} is sufficiently negative, the bias reverses sign in all three model families (Corollary B4). A full-likelihood robustness check shows that this upward pull is partly structural and partly estimator-specific: in the same misspecified one-component model, ML still inflates{sigma}{theta} (+3%), whereas matching only rMZ inflates it much more (+21%). These results do not resolve true intrinsic heritability but establish that Shenhars 50% estimate carries a structured, model-general upward bias originating in the fitted latent variance{sigma}{theta} .

13
Robust Inference of Individualized Treatment Effect in Mendelian Randomization

Liang, M.; Wu, R.; Xiao, F.; Li, X.

2026-05-12 genetics 10.64898/2026.05.08.723855 medRxiv
Top 0.1%
0.6%
Show abstract

Mendelian randomization (MR) is widely used to draw causal conclusions in the presence of unmeasured confounding, but most MR analyses focus on average treatment effects and rely on strong assumptions. For precision medicine, the primary target is instead the individualized treatment effect (ITE); yet in MR, such effects are not point-identified under core IV assumptions, and valid inference is particularly challenging. We therefore propose a robust partial identification inference framework for ITE under MR allowing multiple instruments. Under minimal causal assumptions, we derive a sharp inference procedure for the intersection bounds of ITE by adopting a multiplier bootstrap procedure with data-adaptive bootstrap distribution shifting and heterogeneous variance adjustment. In theory, we prove that the proposed method achieves nominal coverage and asymptotic sharpness. Further, we extend the procedure to tolerate possible invalid IVs under a minimal proportion rule assumption by aggregating over instrument subsets while preserving coverage. Simulation studies demonstrate that the proposed methods attain nominal coverage and substantially shorter intervals than existing procedures. We illustrate the framework using data from the Alzheimers Disease Neuroimaging Initiative to assess heterogeneous causal effects of TREM2 expression on Alzheimers disease risk across education-defined subgroups.

14
Dissecting oligogenic and polygenic indirect genetic effects through the lens of neighbor genotypic identity

Sato, Y.; Hamazaki, K.

2026-04-03 genetics 10.64898/2026.03.31.715746 medRxiv
Top 0.1%
0.6%
Show abstract

Individual phenotypes often depend on the genotypes of other individuals within a group. These phenomena are termed indirect genetic effects (IGEs) and have been distinguished from direct genetic effects (DGEs) using quantitative genetic models. Recent studies have utilized high-resolution polymorphism data to enable genomic prediction (GP) and genome-wide association study (GWAS) of IGEs, but unified methods remain limited. Here we integrate polygenic and oligogenic IGEs using a multi-kernel mixed model incorporating two random effects with a single covariance parameter. Underlying this implementation, the Ising model of ferromagnetics enabled us to simplify locus-wise and background IGEs for GWAS and GP, respectively. Our simulations demonstrated that, while the previous and present models exhibited similar performance, the present model can infer a trade-off between DGEs and IGEs. By applying this method to three species of woody plants, we found evidence for intergenotypic competition in aspen and apple trees, but limited evidence in climbing grapevines. Based on GWAS, we also detected significant variants associated with the competitive IGEs on the apple trunk growth. Our study offers a flexible implementation for GWAS/GP of IGEs, thereby providing an effective tool to dissect the genetic architecture of group performance.

15
Simpler is not always better: Phylodynamic misspecification and deep-learning corrections

XIE, R.; Gascuel, O.; ZHUKOVA, A.

2026-05-08 epidemiology 10.64898/2026.05.07.26352661 medRxiv
Top 0.1%
0.5%
Show abstract

Phylodynamics bridges the gap between epidemiology and pathogen genetic data by estimating epidemiological parameters from time-scaled pathogen phylogenies. Multi-type birth-death (MTBD) models are phylodynamic analogies of compartmental models in classical epidemiology. They serve to infer the average number of secondary infections R and the infection duration d. Moreover, more complex MTBD models add extra parameters, such as the average length of the incubation period or the proportion of superspreaders in the infected population. However, these additional parameters come at an important computational cost: Apart from the simplest, BD, model, MTBD models do not have a closed-form solution and require numerical methods for their likelihood computation. This leads to increased computational times and potential numerical errors. Therefore, the BD model remains the favorite researchers choice for real dataset analyses, and is often applied even in cases where more complex epidemiological aspects are present. We investigated, using simulations, how model misspecification influences inference of R and d in the phylodynamic framework. We showed that the use of models not accounting for various epidemiological aspects leads to bias. In particular the simplest, BD, estimator tends to underestimate R in the presence of super-spreading or incubation, which might be dangerous from the public health prospective. However, deep-learning-based estimators for complex models, which account for multiple epidemiological factors, perform well both on the data where those factors are present and where they are absent. This advocates for the use of complex epidemiologically realistic estimators, whose design has recently become possible thanks to deep learning.

16
A biologically annotated neural network for proteomic discovery in Parkinsons disease

Vijayaraghavan, A.; Crawford, L.; Krishnakant, S.; Amini, A. P.; Conard, A. M.; Olsen, A. L.; Chahine, L. M.; Severson, K. A.

2026-04-30 neurology 10.64898/2026.04.29.26351681 medRxiv
Top 0.2%
0.5%
Show abstract

AO_SCPLOWBSTRACTC_SCPLOWMachine learning models that can utilize high-dimensional data to make predictions and derive biological insights can improve understanding of diseases. Here, we develop a biologically annotated neural network model for proteomics data (P-BANN) which has several practical advantages: (1) it incorporates known relationships between proteins and signaling pathways into its architecture design; (2) it uses Bayesian principles to enable variable selection on the most important proteins for a disease of interests; and (3) it combines structured and black-box variational inference to analyze different classes of phenotypes at scale. To demonstrate the value of the approach, we apply P-BANN to one of the most common neurodegenerative disorders: Parkinsons disease (PD). We consider two biomarker-defined phenotypes within the PD population: presence of neuronal-predominate aggregated -synuclein in cerebrospinal fluid, and changes in dopamine transporter binding in the striatum on imaging. By considering biomarkers of both neuropathological hallmarks of PD, we can examine the extent to which their underlying biology is connected. Using the P-BANN framework, we discover sparse, statistically-calibrated sets of proteins which map to pathways, enabling more straightforward interpretation and generation of testable hypotheses.

17
Anchored Brownian motion and Bayesian methods for the analysis of single particle tracking data

Sgouralis, I.; Boles, A.; Shelby, S.; Pyron, R.

2026-04-22 biophysics 10.64898/2026.04.20.719631 medRxiv
Top 0.2%
0.5%
Show abstract

We present a novel statistical method and a prototype computational implementation for estimating the diffusion coefficient from single particle tracking (SPT) data. Our method is based on anchored Brownian motion which is a novel representation that relaxes the restrictions of conventional Brownian motion. Our method is fully developed in Bayesian terms and allows for robust estimation of diffusion coefficient and quantification of the uncertainly propagated from limited data quantity and quality as appropriate for the analysis of live-cell SPT data. We compare our methods with conventional Brownian motion and demonstrate superior performance in estimating the correct value of the diffusion coefficient. Finally, we benchmark our methods with SPT data from in cellulo and in silico experiments.

18
Robust identification of cell-cell communication heterogeneity in single cells

Bocci, F.; Jia, Y.; Atwood, S.; Nie, Q.

2026-05-04 bioinformatics 10.64898/2026.04.29.721691 medRxiv
Top 0.2%
0.5%
Show abstract

Communication between cells modulates cell fate decisions by relaying information across tissues and inducing intracellular responses mediated by gene regulatory networks. Inference of cell-cell communication from high throughput data such as single cell transcriptomics is gaining popularity due to the high data availability and ease to automate modeling over hundreds of signaling pathways. Studying how cell-cell communication operates across biological scales and influences cell fate decisions, however, remain a major open question. Here, we present scRICH, a framework and package that integrates mechanism-based, multiscale mathematical modeling with learning strategies to capture the complexity of cell-cell communication from single-cell and spatial transcriptomics data. scRICH unravels the heterogeneity of communication behavior within cell types, links cell-cell communication to cell fate decisions by incorporating dynamical information of RNA splicing, and connects the scales of cell-cell interactions and intracellular response by constructing multilayer regulatory networks. We validate scRICH with new experiments on EGF ligand/receptor co-expression in keratinocytes from skin-equivalent organoid, and compare these computational predictions against existing CCC inference methods. Applying scRICH to multiple biological scenarios demonstrate its ability to capture emerging relations between distinct cell-cell communication pathways, interactions at the onset of cell fate decision, and emerging trends in cell-cell communications along cell lineages and in space.

19
Beyond single-slope Mendelian randomization: structural representation of genetic heterogeneity in joint effect space

Hao, H.; Chen, D.; Qian, C.; Zhou, X.; Huang, H.; Zuo, J.; Wang, G.; Peng, X.; Liu, H.-X.

2026-03-14 genetic and genomic medicine 10.64898/2026.03.12.26348288 medRxiv
Top 0.2%
0.5%
Show abstract

Causal effects in complex traits are typically represented by a single linear slope. While conventional Mendelian randomization (MR) provides efficient scalar estimates, projection-based summaries do not explicitly capture structural organisation in joint effect space under genetic heterogeneity. We introduce MR-UBRA (Mendelian randomization-Unified Bayesian Risk Architecture), a probabilistic framework that decomposes instrumental variants into genetic risk fragments (GRFs) and quantifies extreme deviations using tail-risk metrics defined on the standardised residual magnitude |e|. MR-UBRA preserves the classical MR estimand while offering a structurally resolved representation of genetic heterogeneity. Across stroke subtypes, AF[->]CES, smoking[->]lung cancer, and BMI[->]T2D, effect-space distributions exhibit reproducible asymmetry, amplitude stratification, and multi-modal structure. MR-UBRA resolves component-level organisation, separating tail-dominant contributions from the causal core while maintaining consistency with the classical MR estimand. Simulations and boundary regimes demonstrate adaptive model complexity: MR-UBRA selects K>1 when multi-component structure is present and collapses to K=1 under homogeneous conditions, avoiding spurious stratification. These results support viewing causal effects in complex traits as structured distributions in joint effect space, enhancing causal representation without altering the MR estimand. Graphical AbstractMendelian randomization (MR) typically represents causal effects with a single linear slope. Under genetic heterogeneity, instrumental effects in joint ({beta}X, {beta}Y) space may exhibit multi-component structure and amplitude stratification that cannot be captured by a scalar summary. MR-UBRA fits a standard error-weighted mixture model to decompose instruments into genetic risk fragments (GRFs), estimates GRF-specific effects using posterior-weighted soft-IVW, and quantifies extreme deviations through tail-risk metrics (VaR/CVaR). Across empirical analyses and boundary regimes, MR-UBRA adapts model complexity (K) to structural signal, collapsing to K=1 under homogeneous conditions. O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=144 SRC="FIGDIR/small/26348288v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): org.highwire.dtl.DTLVardef@1627086org.highwire.dtl.DTLVardef@1c9982eorg.highwire.dtl.DTLVardef@262730org.highwire.dtl.DTLVardef@d6e551_HPS_FORMAT_FIGEXP M_FIG C_FIG

20
A Beta-Binomial Model for Estimating Zero- or One-inflated Pain Trajectories

Liu, Y.; Harris, R. E.; Clauw, D.; Bayman, E.; Leroux, A.; Lindquist, M. A.

2026-05-11 bioinformatics 10.64898/2026.05.07.721507 medRxiv
Top 0.2%
0.4%
Show abstract

Chronic pain is a widespread public health issue that imposes substantial health, emotional, and economic burdens on individuals and communities. Because pain is subjective and lacks objective biomarkers, it is typically measured using patient-reported scores, often on a numerical scale from zero to ten. Increasingly, pain studies use ecological momentary assessment, with multiple daily assessments over days and across study phases (e.g., a series of baseline and post-intervention assessments). These data frequently show many ratings at the extremes (i.e., at minimum or maximum pain scores), commonly referred to as zero- and one-inflation in the statistical literature, along with considerable within-person variability both within and across days. These phenomena present challenges for statistical analyses, as they violate assumptions of most commonly used statistical techniques (e.g., the normality assumption of linear mixed models). We propose a Bayesian beta-binomial mixed-effects model for modeling potential zero- or one-inflated pain scores while accounting for variability using random effects on the mean and variance parameters across subjects. A simulation study demonstrates that the method accurately estimates model parameters across realistic sample sizes, time points, and zero- and one-inflation levels. An application to data from two longitudinal pain studies demonstrates that the model fits the data better and, when correctly specified, yields accurate uncertainty intervals for longitudinal changes in pain compared to existing models, especially for zero- and one-inflated outcomes. Additionally, the model directly estimates the probability of clinically meaningful pain events. The proposed method provides a powerful statistical framework for studying the patient-reported pain trajectories.